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Information  Storage  Capacity  of  Connectionist  Systems: 
The  Linear  Associator.12 

Dean  C.  Mumme 
and 

Walter  Schneider 


Learning  Research  and  Development  Center 
University  of  Pittsburgh 

Abstract 

Information  theory  Is  applied  to  determine  the  number  of  Items  storable  In  a  linear  associator.  An 
ensemble  of  association  matrices  Is  treated  as  an  M-ary  symmetric  Information  channel  where  M  Is  the 
number  associations  stored  via  the  outer-product  rule.  The  entropy  of  the  ensemble  under  the  outer- 
product  learning  rule  Is  derived  and  used  to  bound  the  number  of  uaefxdly-atorablt  Items  for  the 
ensemble.  In  particular.  If  the  ensemble  has  Input  dimensionality  nj  and  output  dimensionality  nQ,  and  M 

weAf 

associations  between  vectors  of  ±l'a  are  stored,  then  the  entropy  of  each  weight  Is  l/Zlogj-^— . 

ireAf 

Assuming  Independent  weights  gives  the  upper  bound  l/2  n^n0  log2  for  the  entropy  of  the  ensemble. 
The  taak  of  the  ensemble  as  an  M-ary  symmetric  channel  Is  correct  Identification  of  which  output- 
prototype  corresponds  to  the  prototype  presented  at  the  Input.  The  corresponding  task  entropy  or  taak 
load  for  M  stored  Items  Is  Mlog2  M  which  leads  to  the  upper  bound 

M  x  l 

-  <  —  ■+.  — — — 

n/nQ  2  Iog2  M 


for  the  ratio  of  the  number  of  associations  storable  to  the  number  of  weights  In  the  system. 
Asymptotically,  large  matrices  can  store  at  most  half  as  many  associations  as  there  are  weights  In  the 
system.  Storage  efficiency  Is  defined  as  the  number  of  bits  stored  In  the  ensemble  divided  by  the 


M 


number  of  bits  needed  to  specify  the  ensemble  Itself.  The  efficiency  can  be  shown  to  be  less  than - . 

Vo 


Performance  degradation  due  to  storage  of  correlated  vectors  Is  addressed.  A  performance  merit 
parameter,  d'.  Is  derived  as  a  function  of  matrix  size,  number  of  Items  stored,  and  correlation  between 
stored  prototypes.  This  parameter  Is  shown  to  decrease  with  the  square-root  of  M  If  the  vectors  are 
uncorrelated,  otherwise  It  decreases  with  M.  This  Indicates  a  marked  capacity  decline  In  the  correlated 
case  and  reveals  quantitatively  the  sensitivity  of  large  systems  to  prototype  correlation.  In  order  for 
correlation  effects  to  be  negligible,  the  probability  p  that  a  1  occurs  should  be  very  nearly  1/2  as  M  gets 
large.  A  sufficient  condition  Is  that  |p  —  1/2|  <  — A  sufficient  condition  for  correlation  effects 

'A 


1  Piper  is  based  on  a  thesis  by  the  first  author  for  the  doctoral  degee  in  Computer  Science  at  the  University  of  Illinois  at 
I'rb  ana-Champaign. 
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to  be  prevalent  Is  |p  -  1/2|  >  2  V^  AT'^.  This  reflects  the  sensitivity  of  large  systems  vector 
correlation. 

More  generally,  performance  limits  are  derived  by  evaluating  the  task-entropy  and  using  Information 
theoretical  relations  between  input,  memory,  and  output  random-variables.  This  has  Implications  for 
memory-classification  of  Input  vectors.  This  task  can  be  viewed  as  a  retrieval  on  Information-degraded 
Inputs  (e.g.  retrieval  on  partial  or  noisy  Input  vectors).  Performance  Is  limited  by  the  amount  of 
Information  the  Input  vector  provides  about  the  correct  input  prototype.  The  amount  of  information 
provided  by  the  Input  decreases  as  more  classification  *fan-in*  Is  allowed.  The  amount  of  Information  the 
Input  provides  and  Its  relation  to  memory  storage  and  classification  can  be  derived  analytically  In  certain 
Interesting  cases. 

Numerical  evaluation  of  derived  relations  and  simulations  are  Included  to  verify  the  theory.  The 
Intent  of  this  Investigation  Is  to  provide  a  basis  for  the  eventual  development  of  an  Information-theory  of 
memory. 


Information  Storage  and  Classification  in 
Connectionist  Systems: 

The  Linear  Associator 


i 

Introduction 

The  systems  under  consideration  are  an  outgrowth  of  work  done  on  self-organizing  automata  and 
perceptrons  [26.  30)  and  later  work  in  parallel  associative  memories,  e.g.  [IS,  31)  Minsky  and  Papert 
in  [20]  had  carried  out  rather  extensive  mathematical  analysis  on  perceptrons  revealing  Inherent 
limitations  In  the  classes  of  problems  they  could  solve.  These  systems  were  'learning*  automata  expected 
to  classify  Input  'stimuli*  based  on  their  past  experience  on  'training*  Inputs.  Minsky  and  Papert  showed 
that  multiple-stages  of  perceptrons  were  required  for  many  problems  of  Interest  yet  no  training  algorithm 
was  known  at  the  time  for  multi-level  systems.  They  concluded  In  their  book  that  the  systems  held  little 
promise  and  subsequent  Investigation  of  perceptrons  evaporated. 


Eventually  however,  with  more  powerful  computers  to  carry  out  simulations,  and  the  development 
or  several  multi-level  learning  algorithms  [32,  16,  27,  5|  descendant  offshoots  of  the  perceptron  have 
regained  Interest.  Currently  a  variety  of  these  automata  exist  and  are  known  by  names  such  as  'Neural- 
nets',  "Parallel  Distributed  Processors'  (PDP  networks),  'Associative  Memories'.  They  are  collectively 
called  "Connectionist  Architectures'  and  have  been  studied  as  self-organizing  memories  of  perception  |2l| 
content-addressable  memories,  helrarchlcal  knowledge  bases,  and  classification  systems  [3,  2|  models  of 
human  'neural-computation'  [13,  3|  of  human  task  performance  and  attentions]  learning  [37.  35|  speech 
performance  and  natural  language  understanding  [36,  33,  11) 


These  and  other  efforts  have  led  to  guarded  optimism  for  the  future  of  Connectionist  architectures 
as  knowledge  engines  or  as  models  of  human  intelligence.  Capabilities  and  limitations  of  both  task 
learning  and  performance  have  been  demonstrated.  However,  with  the  exception  of  a  few  mathematical 
Investigations  [21.  13.  14.  5,  12)  these  structures  are  understood  primarily  In  a  qualitative  sense. 

In  this  paper,  we  utilize  concepts  from  Information  theory  to  study  a  simple  matrix  model  of 
distributed  memory.  Its  Information-storage  capacity  and  efficiency  are  evaluated  allowing  definition  of^a 
matrix’s  storage  load  factor.  Memory  performance  In  problems  such  as  pattern  completion  can  then  be 
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viewed  as  a  function  of  matrix  loading.  Degradation  of  storage  capacity  with  inter-stimulus  correlation 
and  noise  at  the  input  are  also  addressed. 

This  work  Is  motivated  by  a  simulation* model  of  human  attentlonal  learning  developed  by  the 
authors  (35|.  Though  these  results  are  specifically  intended  for  fuller  understanding  of  the  model,  they 
apply  to  a  much  broader  class  of  *ConnectlonIst*  systems. 


■Neural-based*  systems 

Matrix  models  of  parallel  distributed  memories  were  derived  as  a  simplistic  model  of  brain  cell 
computation.  In  the  model,  the  output  of  each  cell  Is  a  real  number,  y  representing  the  deviation  of  the 
cell's  firing  frequency  from  some  reference  frequency.  As  such,  y  can  be  negative  as  well  as  positive.  The 

Inputs  {xt,x2 xq}  to  the  cell  are  similarly  real  valued  and  each  Input,  x.  has  an  associated  coupling 

strength  w.  to  the  cell  which  determines  the  effectiveness  of  that  Input  on  the  cell  output.  The  cell 
determines  Its  output  by  taking  the  weighted  average  of  the  inputs. 


y  = 


w.x. 

i  > 


The  matrix  memory  Is  constructed  from  a  collection  of  these  cells,  each  sampling  the  same  set  of  Inputs. 
If  nj  Is  the  number  of  Inputs  to  the  memory  and  nQ  Is  the  number  or  cells  In  the  memory,  the  vector 
x  s  (z%.x0,....xnj)  of  Inputs  when  presented  to  the  Input  of  the  system  produces  an  output  vector, 

y  S  (yvy2 . ynQ )  given  by  the  relation: 

1 

y  =  — Ax 
nl 

where  A  Is  the  matrix  of  coupling  weights  w„  connecting  the  ith  Input  to  the  Jth  cell  [15,  21] 


To  store  Information  In  this  system,  two  sets  of  vectors  called  the  Input  prototypes  . fM}  and 

the  output  prototypes  {gj.g2 . gM}  are  used.  For  each  Input  prototype  f  ,  the  weights  of  the  system  are 

adjusted  so  that  the  g_  vector  results  at  the  system  output  when  f  Is  presented  at  the  Input.  The 

m  m 

system  Is  then  said  to  associate  f  with  g_.  For  each  m=1.2 . M,  the  matrix  that  Is  used  to  associate 

m  m 

f  with  g_  (called  the  mth  association)  Is  the  outer-product  g  [IS.  p.  181.  To  store  the  M 
m  m  to  to  *  * 

associations,  these  M  matrices  are  added  to  obtain: 

Si 

A  =  £  (!) 
1 

The  Information  for  each  association  Is  distributed  over  the  whole  of  A  and  therefore  Is  overlaid  with  the 
Information  for  the  other  associations.  The  resulting  Interference  between  associations  Increases  with  M, 
and  ultimately  limits  the  number  of  associations  storable  In  the  system. 


In  the  case  that  f(,f, . fM  are  mutually  orthogonal,  no  Interference  exists.  When  Is  Input  to  the 


2 


system,  we  have2 


M 


*n.  ~  7  E  «X'» 

^mal 
1  r 

=  rk 

nI 


= riif*n2*v  *  =  1-2 . M- 

nJ 

The  matrix  produces  a  multiple  of  gk  when  Is  present  at  the  Input. 
||rkJj2=nf  then  gk  Is  reproduced  exactly  [3,  p.  804j 


If  the  fk  are  chosen  so  that 


The  synopsis  Is  concerned  primarily  with  the  case  that  M>n{  so  that  the  Input  vectors  are  linearly 
dependent  and  Interference  effects  must  be  accounted  for.  In  this  case  the  output  vector  Is  only  an 
approximation  of  the  proper  prototype  output.  Our  concern  Is  the  number  M  of  associations  that  can  be 
stored  In  a  matrix  of  a  given  size  before  the  output  becomes  unrecognizable. 

Characterizing  Storage  Capacity 

To  estimate  the  storage  capacity  of  the  matrix,  we  examine  a  system  that  has  stored  M  associations 

«»>•  m=  1,2 . M  for  some  M.  The  Input-prototype  vectors  are  nj-dimenslonal  and  the  output- 

prototypes  are  n0-dlmenslonal.  Initially,  the  values  allowed  for  the  components  are  ±  1.  All  Input- 
prototypes  will  then  have  H^m{|2~ ni  and  all  output-prototypes  ||gm||2— nQ.  Later  we  can  generalize  but 
this  case  Is  Interesting  In  Itself  as  these  values  represent  saturation  extremes  or  the  Inputs/outputs.  A 
value  of  1  represents  a  cell  firing  at  Its  maximum  rate  and  a  value  of  -I  represents  the  minimum  rate. 
Storing  prototypes  of  this  limited  form  corresponds  to  the  cells  each  producing  a  ‘polarized"  response  to 
an  Input  vector  that  itself  Is  the  result  of  a  previous  stage  of  saturated  cells.  The  vectors  are  assumed  to 
have  an  unbiased  distribution  of  ±l's  as  explained  later. 


To  motivate  the  method  of  storage  measurement,  we  make  an  analogy  with  digital  memory.  The 
address  to  the  memory  can  be  viewed  as  an  input  vector  and  the  retrieved  data  as  the  output  vector.  A 
particular  address  vector  and  the  data  vector  stored  at  the  address  location  can  be  regarded  as  a  vector- 
association  pair.  The  number  of  bits  represented  by  the  data  vector  Is  the  Information  the  system 
provides  upon  performing  the  Input-to-output  association.  For  digital  memory,  the  number  of  bits 
represented  Is  the  same  as  the  number  of  bit-locations  In  the  data  vector  and  so  Is  Identical  with  the 
dimensionality  of  the  data  vector.  Storage  Is  defined  as  the  amount  of  Information  per  association 
multiplied  by  the  number  of  assocltlons  stored  In  memory.  Storage  capacity  Is  the  maximum  storage 
the  system  can  provide.  In  this  case,  the  storage  capacity  Is  limited  by  the  number  of  storage  locations  of 


o 


The  norm  ||[j  refers  to  the  euclidean  norm. 
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the  memory.  Though  the  dimensionality  of  both  the  Input  and  output  vectors  Is  specified  In  advance,  the 
data  Items  are  not.  That  is,  the  number  of  items  that  can  be  stored  Is  not  determined  by  what  they  are. 

For  the  matrix  memory,  the  storage  is  likewise  given  by  the  Information  per  association  multiplied 
by  the  number  of  associations  stored.  The  dimensionality  of  the  Input  and  output  prototypes  are  specified 
In  advance,  but  the  prototypes  themselves  are  not.  For  this  reason,  the  storage  of  the  memory  Is  not 
defined  for  a  particular  matrix  but  rather  for  a  class  of  matrices  ail  of  the  same  size.  The  class  of  outer- 
prod  ict  matrlx-assoclators  of  a  given  size  is  the  set  of  all  matrices  that  can  be  generated  from  vectors  of 
±l’s  via  equation  (1).  An  association  is  not  considered  to  be  stored  In  a  particular  matrix  of  the  class 
unless  unless  it  Is  explicitly  Included  In  the  sum,  (1)  that  determines  the  matrix. 

Unlike  digital  memory,  the  information  per  association  can  be  characterized  In  two  ways.  The  first 

Is  to  present  for  arbitrary  k  €  {l,2 . M)  the  kth  Input  prototype  to  the  system,  and  regard  the  matrix- 

output  as  a  probabilistic  rendition  of  the  kt!l  output  prototype.  On  the  average,  (over  all  matrices  of  the 
class)  given  M,  the  matrix-output  will  provide  a  certain  amount  of  information  about  the  prototype 
output  and  this  is  taken  as  the  information  provided  by  the  association. 

The  second  method  is  to  consider  the  matrix  as  an  Information  channel.  The  kth  input  is  presented 
to  the  system  and  an  output  is  generated.  The  latter  is  compared  with  each  prototype-output  vector  via 
a  similarity  measure  and  the  best  prototype  match  is  chosen.  This  is  called  an  output  decision.  If  the 
Ith  output  prototype  Is  chosen,  an  error  Is  Identified  with  i  ^  k.  The  probability  of  error  averaged  over 
the  matrix-class  Is  taken  as  the  error  probability  for  the  assoclator  as  an  M-ary  symmetric  channel.  The 
average  mutual  Information  between  the  output  and  Input  Is  thus  defined.  This  average  Is  considered  as 
the  Information  per  association. 

In  either  case,  the  storage  Is  the  product  of  M  and  the  Information  represented  by  a  single 
association.  Initially,  the  storage  of  the  matrix  Increases  proportionally  with  M.  The  error  probability 
Increases  with  M  as  well  so  that  the  Information  per  association  gradually  decreases.  For  some  value  M 
of  M.  the  Information  per  association  begins  to  diminish  more  rapidly  than  M  Increases.  At  this  point, 
storing  more  associations  decreases  total  Information  storage  of  the  system.  The  system  has  reached  its 
storage  capacity. 

For  the  second  case,  we  define  for  each  matrix-size,  N,  the  matrix  channel  of  size  N  on  \f 

associations.  It  consists  of  the  ensemble  of  all  possible  matrices  with  n(nQ  =  N  that  can  be  constructed 

from  a  set  of  M  prototype-pairs  If  ,  g  ).  Once  a  set  of  associations  Is  chosen  for  storage,  a  particular 

m  m 

matrix  Is  selected  from  the  ensemble  via  equation  (1).  This  matrix  is  deterministic  and  therefore  Is  not  a 
channel  In  the  usual  sense.  The  storage  for  a  particular  matrix  constructed  from  M  associations  Is  defined 
as  the  storage  of  the  matrix  channel  from  which  It  was  selected. 
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The  matrix-channel  does  not  require  that  the  system  reconstruct  the  appropriate  output  response  as 
does  the  first  storage  characterization.  The  matrix  channel  merely  selects  the  best  match  from  among  the 
M  prototype-outputs.  Therefore  one  would  expect  the  number  of  associations  storable  In  the  matrix- 
channel  to  be  larger  than  In  the  first  type  of  assoclator.  The  storage  capacity  of  the  matrix-channel 
Identifies  the  maximum  number  of  useable3  associations  that  can  be  stored.  Use  of  the  channel  for  Input- 
classification  will  require  storage  at  some  fraction  of  this  maxima!  figure.  Our  objective  Is  to  quantify  the 
maximal  figure  as  a  function  of  channel-size  and  use  It  to  determine  memory  requirements  for  particular 
classification  tasks.  For  this  purpose,  matrix-channel  will  considered  In  what  follows. 

Bounds  on  Storage  Capacity 

Assumptions  and  Notation 

This  analysis  assumes  Important  relative  magnitudes  among  the  parameters.  We  assume 
n.  >  100,  i=l,2.  The  number  of  associations,  M  satisfies  n.  <<Af<< 2n*  1=1,2.*  The  upper 
bound  In  this  case  Is  assumed  to  exceed  M  by  many  orders  of  magnitude.  This  assures  that  sampling 
without  replacement  Is  virtually  Identical  to  sampling  with  replacement  and  simplifies  the  analysis.  An 
optimal  value  M#or  M  will  be  shown  to  exist  that  Is  less  than  the  net-zize.  n,n0.  Therefore,  as  long  as 
the  net-size  Is  insignificant  compared  with  2n‘,  i  =  1.2  the  assumption  on  M  is  Justified. 

The  vector-prototypes  are  chosen  by  Independently  assigning  values  ±  l  to  the  components.  The 
probability  that  either  value  Is  taken  is  1/2.  Random  vectors  will  be  referred  to  with  bold  capltols  (e.g. 

X)  whereas  specific  vector-outcomes  are  denoted  In  bold  lower-case.  For  m=l,2 . M.  the  input- 

prototypes  are  Fm  and  the  output-prototypes  are  Gm  when  considered  as  random  vectors.  The 
components  of  the  Input  vectors  will  be  Indexed  by  *1*  (e.g.  F^.)  and  the  output  vectors  will  be  Indexed  by 
■J*.  The  range  or  1  Is  1.2 . n(  and  that  of  J  Is  1,2 . nQ. 

ir  X,.X„ . X  are  Independent  Identically  distributed  ( i.i.d .)  random  variables  (r.v.'s)  on  {-1,1 } 

12  H 

with  p  =  P[X{~\)  and  S  Is  their  sum,  then  S  Is  binomial  with  parameters  ±  1,  n,  p.  We  denote  this  by 
S  ~  flin(±i,n,p).  Similarly,  If  X  Is  a  normal  r.v.,  with  mean  p,  and  variance  tr2,  we  put  X  ~  N(p,<j). 

The  matrlx-associator  will  be  referred  to  as  *  A*.  Whether  a  random  matrix  or  a  particular  outcome 
Is  being  discussed  should  be  clear  from  context.  To  be  consistent  with  the  *1,  J*  Indexing  of  Input  and 
output  vectors.  *1*  will  refer  to  the  column  and  'J*  to  the  row  of  a  matrix  entry,  e.g.  A.;.  We  define  the 
kth  matrix-output  as 

«’*  s  Af„  (2) 


3*Us»ble*  for  the  purposes  of  input-classification 

4For  positive  perimeters,  »y  >  >  *•  indicates  that  y  is  minimally  10  *  sod  is  typically  much  larger. 
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(The  constant  1/nj  Is  dropped)  and  write  the  corresponding  random  vector  as  G’k.  The  dot-product  of 
the  kth  matrix-output  and  the  l1*1  prototype-output  is  Dkl  with  outcome  dk(. 

Parameters  that  take  values  in  {-l.l}  are  referred  to  as  •bits*  with  -l  acting  as  the  logical  'O'. 
Logical  operations  on  these  parameters  are  defined  In  this  context  as  are  terms  'parity*,  'compliment* 
(logical),  etc. 

Derivation  of  Storage  Limits 

Given  the  M  Input-output  prototype- pairs  (f  ,  gm),  the  matrix  defined  by  equation  (1)  Is  seen  as  the 

sum  of  M  outer-product  matrices.  The  mtl>  outer-product  or  assoeiatlon>plane  Is  completely  determined 

by  the  n,+n„  bits  of  f  and  g  .  Its  Jlth  component,  m..  Is  the  product  f  g  .,  which  takes  values  In 
J  I  O  m  m  Ji  nu  mj 

{-l.l}.  The  mth  association-plane  Is  not  changed  If  both  f  and  gm  are  multiplied  by  -1.  This  Indicates 
that  the  mlh  plane  represents  at  most  n(-t-n0-l  bits  of  Information.  The  nj-t-n0-l  entries  that  make  up  a 
particular  row  and  column,  are  easily  seen  to  be  Independent,  so  that  n,+  n0-l  Is  also  the  lower  bound.  In 
fact,  the  entries  of  the  row  and  column  are  enough  to  determine  every  other  entry  In  the  plane.  To 
Illustrate,  examine  the  kth  row  and  Ith  column  and  the  entry  m.-^f^g^.  The  three  entries  (bits)  m^., 
mk|  and  m^  determine  nm.  so  that  the  parity  of  these  four  numbers  Is  even.5  Therefore  each  association- 
plane  represents  exactly  nj+nQ-l  bits. 

When  the  association-planes  are  summed  Information  Is  lost.  Storage  is  bounded  above  by  the 

information  contained  in  the  weights  (entries)  or  the  assoclator.  An  assessment  of  the  matrix  entropy 

provides  a  bound  on  the  number  of  association  pairs  storable.  To  begin.  It  can  be  shown  that  the  entropy 

or  self-information  of  a  r.v.  X  ~  Bin(±i,n.  1/2)  Is  virtually  Identical  to  that  of  a  normal  r.v.  with 

variance  n.  The  A..  are  fl»n(±l.M.l/2)  so  each  has  entropy  H{Aj.)  =  j log2  2xeM.  An  upper  bound  on 

the  matrix  entropy  can  be  obtained  by  assuming  Independence  of  the  Individual  weights.  One  multiplys 

l 

the  weight-entropy  by  the  number  of  weights  in  the  system  to  get  H(A)  =  -nInc>  log2  2neM 

For  M  stored  associations,  there  are  M!  ways  to  map  the  M  (distinct)  Inputs  to  the  M  (distinct) 
outputs.  To  produce  an  output  vector  for  each  Input  prototype  that  results  In  a  correct  output-decision, 

a  • 

the  matrix  entropy  must  exceed  log2  M!  M-|iog2  M  -  log2  e).  For  M<M  we  must  have 
1 

^n;n0 log,  2ttcM  >  A/Ilogj  M  -  log2  ej 
which  leads  to 

^Therefore  exactly  half  of  the  19  coneievable  configurations  of  these  four  bits  are  possible. 

®For  Af  >  2.2- 10®, log,  M  >  log,  e  so  that  the  log,  e  term  can  be  ignored.  Even  for  M  as  small  as  3000  however,  the 
approximation  log,  M!  log,  M  is  reasonably  accurate. 
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M  !  log2  M  +  log2  2*e 

njn0  <  2  Io«2  M  ~  log2  e 

Generally  we  can  Ignore  the  term  log2  e  and  since  log2  2 ire  «  4,  an  approximate  bound  Is 

Xf  1  2 

n/l0  2  log2  M 

For  the  systems  considered,  the  right  side  of  the  Inequality  will  not  exceed  1  for  M  near  M  .  As  the 
net-size  approaches  Infinity.  M  Is  seen  to  lie  beneath  one-half  the  net-size.  An  Important  observation 
here  Is  that  though  one  row  and  one  column  are  enough  to  specify  the  bits  In  each  assocltlon-plane,  the 
other  bits  act  to  preserve  Information  stored  In  the  plane  when  the  planes  are  summed  together.  Without 
the  additional  bits,  the  entropy  of  the  row  and  column  alone  becomes  -(rt^  +  n0)log2  2*eM.  This  Is  much 
smaller  than  the  entropy  calculated  above  and  will  serve  as  a  lower  bound.  The  assumption  of 
Independent  weights  Is  false  for  Individual  association-planes  but  should  be  accurate  for  M  near  M  since 
the  Inter-correlations  between  bits  In  a  given  plane  should  be  'washed  out'  by  'counter-correlations'  In 
the  other  (Independent)  planes  In  the  sum. 

Measuring  Similarity 

The  output  decisions  of  a  matrlx-assoclator  depend  on  the  similarity  measure  used  at  the  output.  A 
given  system  will  perform  dirrerently  under  dirrerent  similarity  measures.  Therefore,  the  performance  or  a 
system  must  be  defined  with  respect  to  a  particular  similarity  measure.  The  general  definition  of 
similarity  measure  follows  from  the  Hamming  distance  function.  Defining  {-l,l}“  to  be 
{x  €  Rn  |  *,•  €  {-l.l}.  »=  1.2 . n},  the  Hamming  Distance  Is  the  function 

HD:{- l.l}"  X  {  — l.l }n  —  R  given  by  HD(x.y )  =  -  yt|.  The  Hamming  Distance  Is  the 

number  of  components  at  which  x  and  y  disagree.  Its  negative  Is  a  similarity  measure  on  {-l.l}.  If  V  Is 
an  n-dlmenslonal  vector-space,  then  a  similarity  measure  Is  a  function  SiV’xV’— ►  R  such  that  for 
x.y  €  V. 

1.  5(x.y)  =  S(y.x) 

2.  For  x,  y  €  {x  6  V"]  ||x||=l},  S(x.y)  Is  maximized  by  x=y. 

3.  For  x.y. w.*  €  {-l.l}",  HD[x.y)  <  HD(w, t)  implies  S(x,y)  <  5(w,a) 

Under  this  type  or  similarity,  x  and  y  are  to  said  to  be  more  similar  than  w,  a  whenever  S(x.y)  <  5(w,a). 
The  function  Is  maximal  for  similar  vectors.  Condition  3  requires  the  similarity  measure  to  be  consistent 
with  the  negative  Hamming  distance  similarity,  -HD(x.  y)  on  {-1.1 }“. 

We  allow  the  word  'mlnlmumlzed*  to  be  replaced  by  'maximized*  In  2  with  the  reversal  of  the 
Inequality  In  3.  This  results  In  a  function  that  Is  minimal  for  similar  vectors.  The  negative  of  a 
similarity  function  Is  therefore  also  a  similarity  function. 
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Examples  of  similarity  measures  Include  those  based  on  Minkowski  Metrics.  For  Instance,  either  of 
the  forms  S(x,y)  =  |z.  -  yi|p  or  S(x,y)  =  I *,•  +  yt|p.  p  >  0  or  their  negatives  can  be  used. 

An  Inner-product  can  also  be  used.  e.g.  the  dot-product,  5(x.y)  =  t  z(.y,~  The  dot-product  has 
several  advantages  the  first  of  which  Is  the  relative  ease  of  analysis  It  provides.  The  dot-product  detection 
distributions  are  readily  Identified.  Additionally  the  dot-product  similarity  criterion  should  be  a  good 
benchmark  for  the  expected  performance  of  systems  that  are  cor^'-ted  to  the  output  of  the  matrlx- 
assoclator.  This  Is  because  an  assoclator  often  determines  Its  output  by  comparing  the  matrix-input  with 
Its  stored  Input-prototypes  via  the  dot-proauct  similarity  measure.  The  resultant  output  Is  constructed  as 
a  weighted  sum  of  the  output-prototypes  according  to  how  similar  their  respective  Input-prototypes  are  to 
the  matrix-input.  If  an  assoclator  of  this  form  Is  connected  to  the  output  of  a  first-stage  matrlx- 
assoclator.  It  will  function  best  If  the  first  stage  always  produces  a  vec.-jr  that  Is  close  to  the  •correct* 
Input-prototype  of  the  second  stage  with  respect  to  dot-product  similarity. 

Detection 

The  dot-produet  will  be  the  subject  of  the  analysis,  so  that  S(x,  y)  will  represent  this  function. 
Detection  will  consist  of  placing  at  the  Input  of  the  matrix,  determining  the  output  g’k  and  calculating 
S(g’k>  8m)  for  m=1.2,....M.  The  value  of  m  for  which  this  quantity  Is  largest  will  be  chosen  as  the  best 
match.  Since  the  vectors  were  originally  chosen  randomly,  the  dot-products  produced  are  random 
variables.  The  distribution  of  S(G’k,  Gm)  varies  according  to  whether  m  =  k  The  condition  m  —  k  Is 
the  match  condition  and  defines  the  match  distribution  for  the  system.  The  condition  m  ^  k  is  the 
no-match  condition  denning  the  no-match  distribution.  Determination  of  the  distributions  will 
allow  evaluation  of  the  probability  of  an  incorrect  output-decision. 


The  dot  product  Is  Du  s  G’fc  Gj,  k  =  1,2 . M  where  G’ k  —  AFk.  and  A  Is  given  by  (l).  More 

explicitly. 

G\  - 

M 


=  Y  G  ftf, 

m  m  k 

m— l 

\t 

=  Y  (F  FJG 

i—t  '  m  *'  n 


(3) 


From  this  Dk)  Is  seen  to  be, 

°u  -  G,*G< 

,\f 


=  Y  (F  -F.)(G  G. 

m  k  '  m  l 


«) 


m— l 


*nl 


Since  Fm  F^  =  El=1  F^F^and  similarly  for  Gm  G,.  the  sunn  for  Dk)  expands  to 
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M  nl  nO 


Du  -  E  E  E 

r»>“l  i—1  /— 1 


(5) 


The  components  of  the  prototype-vectors  are  chosen  Independently  over  {-1.1}  with  each  or  these 
two  values  occurring  with  probability  1/2.  This  Implies  that  the  terms  In  (5)  are  ‘bits  •  by  our  definition 
so  that  Du  ~  flin(±l,A/n/n0>l/2)when  k  l.  For  k  =  i, 

M  "/  nO 

“tt-WW  +  E  E  E  rm.rMG„,G4y  (») 

m— l,m  yt  h  »■»!  1 

and  ~  Bin(±l.(M  —  l)nJn0,l/2).  For  the  assumed  range  of  M,  M—  l«Mso  that  Dkk  and  DkJ 
have  the  same  variance,  oQ  =  A/n^n^.  The  two  distributions  are  Identical  except  for  the  difference  In  the 
means.  The  mean  of  the  sums  In  (8)  and  (S)  are  zero.  The  first  term  In  (8)  however.  Is  the  constant  n^Q. 
The  match  distribution  then,  has  mean  —  nfiQ  and  the  no-match  has  mean  n2  —  0. 


The  separation.  d‘  of  the  two  distributions  Is  defined  as  the  absolute  difference  of  the  means  divided 
by  the  geometric  mean  of  the  standard-deviations.  Since  the  same  standard-deviation  Is  common  to  both 
distributions,  d'  Is  the  difference  between  the  means  measured  In  standard-devlatlon-length  units: 


d'  = 


-  **2I 


n/no 


\/  Mn[n0 


—  \/  rt  jTi0/  M  (7) 

The  larger  the  relative  separation  between  between  the  distributions,  the  smaller  the  probability  that  an 
outcome  from  one  distribution  will  be  found  near  typical  outcomes  from  the  other  distribution.  As  we  will 
see.  a  large  d'  win  afford  a  low  error-rate.  From  (7),  d'  Increases  with  Increasing  net-size  and  decreases 
with  M  as  would  be  expected.7 


Evaluation  of  Error  Probability 


In  order  to  determine  the  Information  storage  for  a  system  whose  net-size  Is  n^Q  with  M  stored 

associations.  P  must  be  determined  as  a  function  of  M.  An  error  on  the  It11*  Input,  P  .  occurs  If  there  Is 
e  «,k 

an  /  6  { 1.2 . A/},  /  ^  k  such  that  Dk[  >  Dklf.  The  average  over  k  of  Pe  k  Is  P#. 


Let  Rk  denote  the  range  of  possible  values  of  Dkk.  One  minus  the  probability  that  an  error  occurs  Is 
the  probability  that  Dkl  >  Dkk,  l.e. 


A  matrix  with  a  large  number  of  stored  associations  should  poorly  discriminate  between  match  v.s.  no-match  output- 
prototype  vectors 
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c,m 

=  £  Wkl  <  a-Dk2  <a . Dkk-l  <  a-Dkk  <  a‘Dkk+l  <a . Dkm  <  °> 

.6«t 

Since  Dkl  =  G'^G.  and  Dkf  =  G'kGf  the  two  r.v.’s  both  contain  Information  from  G*k  and  are  not 
strictly  Independent.  However,  the  dot-products  are  Independent  given  G’k  and  they  each  provide  very 
little  Information  about  its  components.  We  assume  then  that  they  are  very  nearly  Independent.  This 
allows  approximation  of  Pe  k  by 

M 

p  ,  - 1  -  £  nou  - «)  n  w*  <  »>• 

a  €  Kk  1—1.1  1*  * 

The  Dkl>  k  l  are  Identically  distributed  as  no-matches,  so  letting  Dk  be  a  r.v.  with  the  no-match 
distribution  gives8 

p,,k  - 1  -  E  *»>,  s  -  •>  (») 

.68, 

If  we  define  F’  as  the  distribution  5»n(±l.A/n^n0,l/2)  with  mean  of  zero,  and  f  as  the 

corresponding  density  function,  then  (8)  can  be  written. 

Pek  =  l-  ]T  F'{a)-f(a-  fij  (9) 

a€R* 

where  the  argument  to  f  must  be  displaced  by  the  mean  of  Dkk.  The  distribution,  F’  can  be  "normalized* 

by  dividing  all  dot-product  r.v.'s  by  o D~  \l Mn^n.Q  to  obtain  the  distribution 

F  ~  Bin(±\/>J Mnjr 1^,1. 1/2)  with  mean  of  zero.  The  error  P#  k  becomes 

Pe  k  =  1  -  £  F\a)M~l  f(a  -  d’)  (10) 

aeRk 

where  f  Is  the  density  or  the  normalized  distribution  F. 

The  expression  above  Is  not  dependent  on  k.  so  the  average  probability  P#  that  an  Input  will 
produce  an  error  at  the  output  Is  given  by  equation  (10);  the  matrix-channel  has  been  shown  to  be  M-ary 
symmetric.  If  X  represents  the  input  vector  r.v.  and  Y  the  subsequent  output  vector  r.v.,  then  the 
Information  per  association  Is  given  by 

7(X:Y)  =  log,  M  -  P  log,  (M— l)  -  Hb(Pt)  (11) 

where  B^x)  =  -  r  log,  x  —  ( 1  —  x)  log,  (1  -  x),  0  <  x  <  1  Is  the  binary  entropy  function. 

For  a  given  matrix-class,  we  evaluate  the  storage  Af/(X;Y)  for  Increasing  M  until  the  maximum 
storage  Is  found.  The  maximum  Is  called  the  storage  capacity  of  the  net.  The  value  M  of  M  that 
produces  the  maximum  Is  called  the  storage  addressability  of  the  system  under  this  storage 


8Since  F\Dk 


a)  as  0,  no  distinction  between  BDk  <  a)  a ni  f\Dk  ^  a)  i*  made. 
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characterization.  Uniqueness  of  M  depends  upon  the  nature  of  I(X;Y)  as  a  function  of  M.  This  function 
Is  plotted  in  figure  for  a  net-size  of  10s'5.  The  throughput  addressability  Is  the  value  M°  of  M  at 
which  the  maximum  Is  achieved.  The  function  Is  believed  to  be  unlmodal.  Increasing  to  a  maximum 
before  M  reaches  the  net-size  and  then  decreasing  rapidly  thereafter.  The  storage  should  reach  a 
maximum  before  M  reaches  the  bound  given  by  equation  and  remaining  low  for  larger  M  as  long  as 
M  <  2ni.  I  =  1.2  Is  satisfied.  The  numerical  analysis  carried  out  to  date  bears  this  out.  However,  a 
normal  approximation  to  the  distribution  F  In  equation  (10)  was  used  and  Is  highly  Inaccurate  for  large 
M.  Presently,  a  more  accurate  approximation  Is  being  devised  [20j  Numerical  methods  based  on  the  new 
approximation  and  actual  simulations  of  assoclator  matrices  will  be  used  to  determine  storage  of  the 
systems  and  the  validity  of  the  analysis. 

Data-Dependence  of  Capacity 

In  the  forgoing  development,  we  assumed  the  vector-prototypes  were  chosen  randomly.  Random 
vectors'  tendency  toward  pairwise  orthogonality  keeps  Interference  among  associations  low.  Subsequent 
sections  examine  suboptimal  prototype  storage  and  retrelval.  The  object  will  be  to  characterize 
deleterious  effects  of  storing  low-entroplc  associations. 


Storage  Efficiency 

Storage  efficiency  of  a  matrix  Is  the  matrix-storage  divided  by  the  Information  required  to  represent 
a  matrix  assoclator  on  M  associations.  Examination  of  equation  (1)  reveals  that  each  entry  In  an 
assoclator  matrix  Is  the  sum  of  M  bits.  The  range  of  values  of  each  entry  Is  the  Integers  between  -M  and 
M.  The  extremes  are  realized  whenever  the  bits  for  that  entry  all  agree  In  value.  Further,  the  entry  will 
be  be  even  If  and  only  If  M  Is  even.  It  follows  that  the  number  of  values  an  entry  can  assume  Is  M+l. 
This  means  that  n^  weights  will  require  n^n^ogj  (A/+ l)«n^n0log2  Af  bits  for  storage.  Letting 
E  =  Af /(X; Y)  be  the  storage  of  the  net,  then  we  define  the  efficiency  t)  by 


nlnOl°*2  V/ 

Since  /(X;Y)  <  logj  M  by  equation  (11),  It  follows  that  E  <  Afiogj  A/ and  we  have 
M  log2  M  M 

^  ~  n/n0lo*2  ninO 

From  equation  .  the  bound  becomes 


1 

*1  <  “  -t-  ■ 


2  log,  M 

If  one  could  take  advantage  W  the  fact  that  each  weight  has  entropy  1/2-log,  2^eM.  the  Information 
required  to  Impllment  the  matrix  becomes  1/2-n  n-.log,  2?reM  as  stated  earlier.  One  could  therefore 

i  U  m 

define  the  efficiency  by 
r 


n  = 


l/2-n^n^log,  2:reM 


Equation  stipulates  the  efficiency  defined  this  way  Is  less  than  unity. 
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The  second  of  these  efficiency  definitions  might  be  most  appropriate  If  l^n^n^logj  -2neM  were  the 
maximum  achievable  entropy  of  the  weights.  However,  a  method  for  achieving  the  matrix  entropy 
n/nC)log2  (M+  1)  Is  being  formulated  through  Judicious  choice  or  the  associations  to  be  learned.  If 
successful,  the  maximum  storage  possible  for  a  matrix  would  be  shown  to  be  n^n0!og2A/.  The  first 
definition  of  efficiency  would  then  Indicate  the  relation  of  random  storage  to  optimal  storage. 

Sensitivity  of  Storage  to  Vector  Correlation 

Previously  the  vector-components  of  the  prototype-vectors  were  Independently  selected  from  {-l.l} 
with  probability  1/2  that  either  value  was  taken.  If  a  bias  Is  made  In  choosing  the  vectors  so  that  the 
probability  that  the  value  1  occurs  Is  p  for  each  vector  component,  then  the  storage  capacity  Is  adversely 
affected.  In  this  sense,  the  unbiased  selection  was  optimal.  Two  questions  are  Important  for  the 
consideration  of  biased  vector  selection: 

1.  What  does  bias  cost  In  terms  of  reduced  memory  capacltyf 

2.  How  nearly  unbiased  must  the  selection  process  be  In  order  for  the  matrix  to  perform  nearly 
optimally? 

The  first  question  addresses  the  severity  of  memory  degradation  with  bias.  The  second  relates  to  the 
practicality  of  achieving  near  optimal  storage. 

The  analysis  reveals  capacity  degradation  as  a  consequence  of  reduced  d'  due  to  bias-induced  vector- 

correlation.  The  bia*,  A  Is  defined  as  A  =  |p  -  1/2|  where  the  blu-probablllty  p.  Is  the  probability 

that  any  vector  component  Is  assigned  the  value  1.  The  Input  may  be  selected  with  a  different  bias  than 
•  • 
the  output  so  we  let  pF  be  the  blas-probablilty  for  the  Input  prototypes  and  pG  be  the  bias-probability  for 

the  output. 

To  see  how  bias  afreets  vector  correlation,  let  U  and  V  be  n-dlmenslonal  vectors  on  {-l.l}  with 
blas-probablllty  p  .  When  the  components  are  chosen  Independently,  the  probability  that  a  component  of 
U  will  agree  with  Its  counterpart  In  V  Is 

P[Ui  =  V.)  =  f\U.  =  1.  V.  =  i)  +  P[Ui  =  -1.  V.  =  -1)  1-1.2 . n. 

=  P[Ui  =  l)P[V.  =  1)  +  P[U.  =  -l)P[V.  =  -i) 

=  p*2  +  (1  -  p*)2 
=  2  (p*  -  1/2)2  +  1/2 

=  l/2  +  2-a2  (12) 

So  flU.  =  V.)  >  1/2  with  equality  when  p  Is  1/2  (A  —  0). 

We  define  Pr  =  p?  +■  (1  -  PFf  to  be  the  probability  that  =  Fm,.  for  arbitrary 
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m,  m!  €  {l,2 . M)  and  i  g  {l,2 . rtj}.  We  say  that  the  Input  prototypes  are  pp-correlated. 

Similarly,  the  parameter  pQ  represents  the  correlation  between  pairs  of  output  prototypes.  The  d' 
parameter  can  be  evaluated  for  the  system  by  determining  the  mean  and  variance  of  both  the  match  and 
no-match  distributions.  The  derivation  of  these  Is  tedious  and  non-lnformatlve  and  so  will  be  left  to  an 

appendix.  Results  pertinent  to  the  discussion  will  be  related  here.  For  the  match  distribution,  the  mean 
2 

/ij  and  variance  <7j  are 

l1,  =  n;n0  U  +  (M-  l)(2pF~  l)(2Pa  -  * )i 

a\  =  nin0^1  +  W2PF-  l )(2pa  -  Dili  -  (2 PF-  1)(2 Pa  ~  1)1  (13) 

o 

The  no-match  parameters  are 

P2  =  n^n0\{2vF-  1)  +  (2 pQ  -  1)  4-  M2pf-  l)(2pG  -  1)1 

a2  =  ninOM'\l  +  A/(2 Pf-  D(2pc  -  1){1  -  (2 PF  -  1)(2 PG  -  l))l  (14) 

If  pF  and  pG  are  set  to  1/2  In  the  above  equations,  the  mean  and  variance  assume  the  values  for  the 
unbiased  distributions  considered  earlier.  On  the  other  hand.  If  each  bias  Is  large  enough  (but  not  too 
close  to  1)  Tor  the  relation 

M(2pF-  l)(2 pQ  -  l)(l  -  (2pF-  l)(2pc  -  D)  >  1  (15) 

to  hold,  then  both  the  match  and  no-match  variances  can  be  approximated  by 
n[n0M2(2pF— l){2pa  -  l)(l  -  (2pF- l)(2pG  ~  1)).  The  absolute  dllTerence  between  the  means  Is 
intn0(pF~  1  )(pc  -  1)  so  that  from  the  definition  of  d’  In  (7),  we  have10 


Whereas  d‘  varied  Inversely  as  >/a?  In  the  unbiased  case.  It  varies  Inversely  as  M  when  a  bias  Is 
present.  Therefore,  a  bias  Is  thought  to  severely  limit  the  capacity  or  the  assoclator.  On  the  other  hand, 
a  bias  must  be  present  on  both  the  Input  and  output  vectors  for  the  effect  to  be  present.  Correlated 
vectors  are  not  as  nearly  orthogonal  as  are  uncorrelated  vectors.  Interference  effects  will  not  be  present  If 
the  assoclator  either  maps  correlated  vectors  to  nearly  orthogonal  vectors  or  vice-versa.  In  particular.  If 
correlated  Input  vectors  are  associated  to  uncorrelated  output  vectors,  no  resulting  capacity  degradation  Is 
present.  An  assoclator  could  be  used  as  a  •front-end*  to  other  assoclator  units  In  order  to  translate 
correlated  Input  vectors  to  uncorrelated  outputs  for  further  processing. 

I 

0 

Notice  tbe  subtle  difference  between  the  match  and  no-match  variances.  This  is  not  an  error! 

*°In  this  discussion,  the  correlations  are  considered  as  by-products  of  tbe  bias  so  that  the  vector  prototypes  can  be 
considered  as  mutually  independent.  However,  calculation  of  the  matcb/no-malch  mean  and  variances  and  that  of  d’  was 
csrned  out  without  the  assumption  of  independence  between  respective  components  of  tbe  prototype  vectors. 
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In  order  for  correlation  effects  not  to  be  significant,  the  bias  should  be  small  enough  so  that  the 
reverse  of  the  conditions  (15)  should  hold.  One  could  Ignore  pp  and  pG  In  (14)  and  (13)  If  they  satisfied" 
M(2pF~  D(2pc-  1)  <  1/0 

Say  for  example,  the  bias  of  the  Input  and  the  output  prototypes  were  the  same.  We  set  both  pp  and  pG 
equal  to  \/2±A2  In  accordance  with  (12).  From  condition  ,  It  follows  that  M[2pF~  l)(2pc  —  1)  >  1. 

The  bias,  A  would  have  to  satisfy  (2(1/2  ±  2 A2)  —  1|2  <  1  /DM  so  that  A  cannot  exceed  . 

2\/s 

Large  assoclators  with  many  stored  associations  will  require  small  values  of  A  to  perform  nearly 
optimally.  It  Is  the  large  systems  that  will  suffer  substantial  capacity  deterioration  If  care  Is  not  taken  to 
Insure  that  the  vector  prototypes  are  chosen  with  nearly  even  distribution  of  -l's  and  l’s. 


When  A  Is  large  enough  to  limit  performance.  It  Is  desirable  to  substitute  d'  from  equation  (18)  Into 
(10)  and  (11)  to  estimate  the  reduced  capacity.  A  large  bias  however  will  compromise  the  Independence  of 

the  dot-products  D^.  k.l  6  (l.2 . M)  that  was  assumed  for  the  derivation  of  (10).  At  best.  (10) 

might  be  accurate  for  the  smallest  values  of  A  In  the  non-optlmal  range.  If  we  assume  pp  equals  pG>  then 
the  smallest  non-optlmal  value  for  M  associations  Is  determined  from  (15)  and  so  must  satisfy 
M(2pF-  1)(2  pG-  1)»  1 
We  take  "O*  to  be  much  greater  than  1  and  get 


A 


'/z  ... 

— M 


An  upper  bound  on  the  capacity  may  be  found  by  estimating  the  entropy  of  the  matrix  weights 
which  will  be  distributed  as  Bin(±l,n,p)  where  p  Is  determined  from  pp  and  pG-  Again,  only  the 
smallest  values  of  non-optlmal  A  can  be  considered  by  this  method  since  the  weights  will  lose  their 
Independence  as  the  bias  becomes  large. 


it 


The  friction  * 


1/10*  is  <  1  but  0  is  i  perfect  squire  so 


■I/O*  is  used 


IS 
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